Non-interactive OCR Post-correction for Giga-Scale Digitization Projects

نویسنده

  • Martin Reynaert
چکیده

This paper proposes a non-interactive system for reducing the level of OCR-induced typographical variation in large text collections, contemporary and historical. Text-Induced Corpus Clean-up or ticcl (pronounce ’tickle’) focuses on high-frequency words derived from the corpus to be cleaned and gathers all typographical variants for any particular focus word that lie within the predefined Levenshtein distance (henceforth: ld). Simple text-induced filtering techniques help to retain as many as possible of the true positives and to discard as many as possible of the false positives. ticcl has been evaluated on a contemporary OCR-ed Dutch text corpus and on a corpus of historical newspaper articles, whose OCR-quality is far lower and which is in an older Dutch spelling. Representative samples of typographical variants from both corpora have allowed us not only to properly evaluate our system, but also to draw effective conclusions towards the adaptation of the adopted correction mechanism to OCR-error resolution. The performance scores obtained up to ld 2 mean that the bulk of undesirable OCR-induced typographical variation present can fully automatically be removed.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Measuring Search Retrieval Accuracy of Uncorrected OCR: Findings from the Harvard-Radcliffe Online Historical Reference Shelf Digitization Project

This report presents the findings of an investigation to evaluate the conditions for search retrieval successes and failures when using uncorrected OCR for indexing. The purpose of the study was to assess whether low-cost, high-production techniques for text conversion were adequate to produce digital reproductions of consistent quality and usability. We sought to identify attributes of the ori...

متن کامل

Estimating Digitization Costs in Digital Libraries Using DiCoMo

The estimate of digitization costs is a very difficult task. It is difficult to make exact predictions due to the great quantity of unknown factors. However, digitization projects need to have a precise idea of the economic costs and the times involved in the development of their contents. The common practice when we start digitizing a new collection is to set a schedule, and a firm commitment ...

متن کامل

A Document Recognition System for Early Modern Latin

Large-scale digitization of manuscripts is facilitated by high-accuracy optical character recognition (OCR) engines. The focus of our work is on using these tools to digitize Latin texts. Many of the texts in the language, especially the early modern, make heavy use of special characters like ligatures and accented abbreviations. Current OCRs are inadequate for our purpose: their built-in train...

متن کامل

Keep, Change or Delete? Setting up a Low Resource OCR Post-correction Framework for a Digitized Old Finnish Newspaper Collection

There has been a huge interest in digitization of both hand-written and printed historical material in the last 10–15 years and most probably this interest will only increase in the ongoing Digital Humanities era. As a result of the interest we have lots of digital historical document collections available and will have more of them in the future. The National Library of Finland has digitized a...

متن کامل

Automated Error Detection in Digitized Cultural Heritage Documents

The work reported in this paper aims at performance optimization in the digitization of documents pertaining to the cultural heritage domain. A hybrid method is proposed, combining statistical classification algorithms and linguistic knowledge to automatize post-OCR error detection and correction. The current paper deals with the integration of linguistic modules and their impact on error

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008